Cluster analysis examples

#uncomment the following line if you haven't installed these packages yet
#install.packages(c('vegan', 'cluster', 'rioja', 'gplots', 'adjclust)) 

library('vegan')
library('cluster')
library('rioja')
library('gplots')
library('adjclust')

# also uses stats package, included with base R.

Load data

We can use the Samwell Cave and pollen datasets from yesterday:

load(file="data/scpdFauna_cleaned.RData") #object is scpdSM 
pollen12K <- read.csv(file="data/12000.interp.data_clean_sites.csv")

Cluster analyses: perform clustering

Step 1: Create dissimilary matrix

We will apply three different distance measures to our data: euclidean, bray-curtis, and chao.

scpdSM_euc <- vegdist(scpdSM, method="euclidean")
scpdSM_bray <- vegdist(scpdSM, method="bray")
scpdSM_chao <- vegdist(scpdSM, method="chao")
# print(scpdSM_euc)
# print(scpdSM_bray)
# print(scpdSM_chao)

Step 2: cluster samples according to different algorithms

?hclust
clust_euc_ward.D2 <- hclust(scpdSM_euc, method="ward.D2")
clust_euc_neighbor <- hclust(scpdSM_euc, method="single")
clust_euc_upgma <- hclust(scpdSM_euc, method="average") #UPGMA
clust_euc_wpgma <- hclust(scpdSM_euc, method="mcquitty") #WPGMA
clust_bray_ward.D2 <- hclust(scpdSM_bray, method="ward.D2")
clust_bray_neighbor <- hclust(scpdSM_bray, method="single")
clust_bray_upgma <- hclust(scpdSM_bray, method="average") #UPGMA
clust_bray_wpgma <- hclust(scpdSM_bray, method="mcquitty") #WPGMA
clust_chao_ward.D2 <- hclust(scpdSM_chao, method="ward.D2")
clust_chao_neighbor <- hclust(scpdSM_chao, method="single")
clust_chao_upgma <- hclust(scpdSM_chao, method="average") #UPGMA
clust_chao_wpgma <- hclust(scpdSM_chao, method="mcquitty") #WPGMA

Cluster analyses: Plot in various ways

Plot simple dendograms

plotting_objs <- ls()[grep("clust", ls())]
print(plotting_objs)
 [1] "clust_bray_neighbor" "clust_bray_upgma"    "clust_bray_ward.D2"  "clust_bray_wpgma"   
 [5] "clust_chao_neighbor" "clust_chao_upgma"    "clust_chao_ward.D2"  "clust_chao_wpgma"   
 [9] "clust_euc_neighbor"  "clust_euc_upgma"     "clust_euc_ward.D2"   "clust_euc_wpgma"    
# Note: I'm writing the cluster diagram to a pdf file, so it's easier to read. You can uncomment the pdf() and dev.off() lines to print to the screen, or just not copy those lines
pdf(file="output/cluster_diagrams.pdf", height=11, width=8.5)
  par(mfcol=c(4,3), mar=c(2,4,1,1)+.01)
  for (i in 1:length(plotting_objs)){
    plot(get(plotting_objs[i]), cex=0.75, xlab=NA, sub=NA, main=plotting_objs[i], hang=-.1)
  }
dev.off()
null device 
          1 

Plot heat maps of the results

dissimColors <- gray.colors(30, start = 0.9, end = 0.1, gamma = 2.2, alpha = NULL)
ages <- as.numeric(rownames(scpdSM))
ageColors <- rainbow(length(ages), start=0, end=4/6)
heatmap(scpdSM, col=dissimColors, distfun=function(m) vegdist(m,method="bray"), hclustfun=function(m) hclust(m, method="average"), RowSideColors = ageColors)

Note 1: I’ve added some colors to the age labels. If adjacent strata clustered together, the colors would be ordered in a gradient. Note 2: in this case the dendrogram for the taxon names is meaningless, so let’s suppress it. But we can’t do this with the base heatmap function. We need to use a function in a different package: gplots

heatmap.2(scpdSM, dendrogram='row', col=dissimColors, distfun=function(m) vegdist(m,method="bray"), hclustfun=function(m) hclust(m, method="average"), RowSideColors = ageColors, tracecol=NA)

Cluster analysis: Constrained clustering

Coniss clustering

See the reference paper in the refs folder (Grimm 1987) for more info on the CONISS implementation. Note, here we have compared the data to a broken-stick distribution to estimate the minimum number of significant clusters. A broken-stick distribution is simple to calculate and the expected values of the pieces of the broken stick are given by Equation 1: Ej = 1/n * (sum(from x=j to n) 1/x) Ej is the expected value of the jth piece, and n is the number of pieces (breaks in this case)

constrained <- chclust(scpdSM_bray, method="coniss")
bstick <- bstick(constrained, 10)

ngroups <- min(which(bstick$dispersion>bstick$bstick))  # in this case, no clusters should be retained because even with two groups, the broken stick values are higher than the observed Sum of Squares.
x <- strat.plot(scpdSM, yvar=ages, title="", y.rev=T, cex.xlabel=0.5,cex.yaxis=0.5, cex.axis=0.5, clust=constrained) 
addClustZone(x, constrained, col="red", ngroups)

Add this constrained clustering dendrogram to the heatmap instead of the default hclust dendro. Note that now the colors/ages are ordered.

consDendro <- as.dendrogram(constrained)
heatmap.2(scpdSM, Rowv= consDendro, col=dissimColors, distfun=function(m) vegdist(m,method="bray"), RowSideColors = ageColors, tracecol=NA, dendrogram="row")

Alternate packages for constrained clustering

I ran across this package for constrained clustering: adjclust. See this website https://github.com/rstats-gsoc/gsoc2017/wiki/Constrained-Hierarchical-Agglomerative-Clustering for an in-depth discussion

fit1 <- adjClust(scpdSM_bray, type="dissimilarity")
Note: input class is 'dist' so 'type' is supposed to be 'dissimilarity'
Note: 1 merges with non increasing heights.
plot(fit1, leaflab="perpendicular")

Detected reversals in dendrogram: mode = 'corrected', 'within-disp' or 'total-disp' might be more relevant.

Exercises

  1. Generate some example data, calculate dissimilarity, and then hand-calculate the dendrogram using one clustering algorithm
  2. Explore cluster analysis using the fossil pollen data.
LS0tCnRpdGxlOiAiQW5hbHl0aWNhbCBQYWxlb2Jpb2xvZ3kgVHV0b3JpYWxzIDIwMTgiCnN1YnRpdGxlOiAiQ2x1c3RlciBBbmFseXNpcyIKYXV0aG9yOiAiSmVzc2ljYSBCbG9pcyIKZGF0ZTogIkF1Z3VzdCAyLCAyMDE4IChsYXN0IHVwZGF0ZWQ6IE9jdG9iZXIgMjksIDIwMTgpIgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazogZGVmYXVsdAogIHBkZl9kb2N1bWVudDogZGVmYXVsdAotLS0KCiMjIENsdXN0ZXIgYW5hbHlzaXMgZXhhbXBsZXMKCmBgYHtyIHBhY2thZ2VfbG9hZGluZywgbWVzc2FnZT1GLCB3YXJuaW5nPUZ9CiN1bmNvbW1lbnQgdGhlIGZvbGxvd2luZyBsaW5lIGlmIHlvdSBoYXZlbid0IGluc3RhbGxlZCB0aGVzZSBwYWNrYWdlcyB5ZXQKI2luc3RhbGwucGFja2FnZXMoYygndmVnYW4nLCAnY2x1c3RlcicsICdyaW9qYScsICdncGxvdHMnLCAnYWRqY2x1c3QpKSAKCmxpYnJhcnkoJ3ZlZ2FuJykKbGlicmFyeSgnY2x1c3RlcicpCmxpYnJhcnkoJ3Jpb2phJykKbGlicmFyeSgnZ3Bsb3RzJykKbGlicmFyeSgnYWRqY2x1c3QnKQoKIyBhbHNvIHVzZXMgc3RhdHMgcGFja2FnZSwgaW5jbHVkZWQgd2l0aCBiYXNlIFIuCmBgYAoKIyMgTG9hZCBkYXRhCgpXZSBjYW4gdXNlIHRoZSBTYW13ZWxsIENhdmUgYW5kIHBvbGxlbiBkYXRhc2V0cyBmcm9tIHllc3RlcmRheToKCmBgYHtyIGxvYWRfZGF0YX0KbG9hZChmaWxlPSJkYXRhL3NjcGRGYXVuYV9jbGVhbmVkLlJEYXRhIikgI29iamVjdCBpcyBzY3BkU00gCnBvbGxlbjEySyA8LSByZWFkLmNzdihmaWxlPSJkYXRhLzEyMDAwLmludGVycC5kYXRhX2NsZWFuX3NpdGVzLmNzdiIpCmBgYAoKIyMgQ2x1c3RlciBhbmFseXNlczogcGVyZm9ybSBjbHVzdGVyaW5nCgojIyMgU3RlcCAxOiBDcmVhdGUgZGlzc2ltaWxhcnkgbWF0cml4CgpXZSB3aWxsIGFwcGx5IHRocmVlIGRpZmZlcmVudCBkaXN0YW5jZSBtZWFzdXJlcyB0byBvdXIgZGF0YTogZXVjbGlkZWFuLCBicmF5LWN1cnRpcywgYW5kIGNoYW8uCmBgYHtyIGRpc3NpbX0Kc2NwZFNNX2V1YyA8LSB2ZWdkaXN0KHNjcGRTTSwgbWV0aG9kPSJldWNsaWRlYW4iKQpzY3BkU01fYnJheSA8LSB2ZWdkaXN0KHNjcGRTTSwgbWV0aG9kPSJicmF5IikKc2NwZFNNX2NoYW8gPC0gdmVnZGlzdChzY3BkU00sIG1ldGhvZD0iY2hhbyIpCgojIHByaW50KHNjcGRTTV9ldWMpCiMgcHJpbnQoc2NwZFNNX2JyYXkpCiMgcHJpbnQoc2NwZFNNX2NoYW8pCmBgYAoKIyMjIFN0ZXAgMjogY2x1c3RlciBzYW1wbGVzIGFjY29yZGluZyB0byBkaWZmZXJlbnQgYWxnb3JpdGhtcwpgYGB7ciBwZXJmb3JtX2NsdXN0ZXJfYW5hbHlzaXN9Cj9oY2x1c3QKY2x1c3RfZXVjX3dhcmQuRDIgPC0gaGNsdXN0KHNjcGRTTV9ldWMsIG1ldGhvZD0id2FyZC5EMiIpCmNsdXN0X2V1Y19uZWlnaGJvciA8LSBoY2x1c3Qoc2NwZFNNX2V1YywgbWV0aG9kPSJzaW5nbGUiKQpjbHVzdF9ldWNfdXBnbWEgPC0gaGNsdXN0KHNjcGRTTV9ldWMsIG1ldGhvZD0iYXZlcmFnZSIpICNVUEdNQQpjbHVzdF9ldWNfd3BnbWEgPC0gaGNsdXN0KHNjcGRTTV9ldWMsIG1ldGhvZD0ibWNxdWl0dHkiKSAjV1BHTUEKCmNsdXN0X2JyYXlfd2FyZC5EMiA8LSBoY2x1c3Qoc2NwZFNNX2JyYXksIG1ldGhvZD0id2FyZC5EMiIpCmNsdXN0X2JyYXlfbmVpZ2hib3IgPC0gaGNsdXN0KHNjcGRTTV9icmF5LCBtZXRob2Q9InNpbmdsZSIpCmNsdXN0X2JyYXlfdXBnbWEgPC0gaGNsdXN0KHNjcGRTTV9icmF5LCBtZXRob2Q9ImF2ZXJhZ2UiKSAjVVBHTUEKY2x1c3RfYnJheV93cGdtYSA8LSBoY2x1c3Qoc2NwZFNNX2JyYXksIG1ldGhvZD0ibWNxdWl0dHkiKSAjV1BHTUEKCmNsdXN0X2NoYW9fd2FyZC5EMiA8LSBoY2x1c3Qoc2NwZFNNX2NoYW8sIG1ldGhvZD0id2FyZC5EMiIpCmNsdXN0X2NoYW9fbmVpZ2hib3IgPC0gaGNsdXN0KHNjcGRTTV9jaGFvLCBtZXRob2Q9InNpbmdsZSIpCmNsdXN0X2NoYW9fdXBnbWEgPC0gaGNsdXN0KHNjcGRTTV9jaGFvLCBtZXRob2Q9ImF2ZXJhZ2UiKSAjVVBHTUEKY2x1c3RfY2hhb193cGdtYSA8LSBoY2x1c3Qoc2NwZFNNX2NoYW8sIG1ldGhvZD0ibWNxdWl0dHkiKSAjV1BHTUEKYGBgCgojIyBDbHVzdGVyIGFuYWx5c2VzOiBQbG90IGluIHZhcmlvdXMgd2F5cwoKIyMjIFBsb3Qgc2ltcGxlIGRlbmRvZ3JhbXMKYGBge3IgcGxvdF9kZW5kcm99CnBsb3R0aW5nX29ianMgPC0gbHMoKVtncmVwKCJjbHVzdCIsIGxzKCkpXQpwcmludChwbG90dGluZ19vYmpzKQojIE5vdGU6IEknbSB3cml0aW5nIHRoZSBjbHVzdGVyIGRpYWdyYW0gdG8gYSBwZGYgZmlsZSwgc28gaXQncyBlYXNpZXIgdG8gcmVhZC4gWW91IGNhbiB1bmNvbW1lbnQgdGhlIHBkZigpIGFuZCBkZXYub2ZmKCkgbGluZXMgdG8gcHJpbnQgdG8gdGhlIHNjcmVlbiwgb3IganVzdCBub3QgY29weSB0aG9zZSBsaW5lcwpwZGYoZmlsZT0ib3V0cHV0L2NsdXN0ZXJfZGlhZ3JhbXMucGRmIiwgaGVpZ2h0PTExLCB3aWR0aD04LjUpCiAgcGFyKG1mY29sPWMoNCwzKSwgbWFyPWMoMiw0LDEsMSkrLjAxKQogIGZvciAoaSBpbiAxOmxlbmd0aChwbG90dGluZ19vYmpzKSl7CiAgICBwbG90KGdldChwbG90dGluZ19vYmpzW2ldKSwgY2V4PTAuNzUsIHhsYWI9TkEsIHN1Yj1OQSwgbWFpbj1wbG90dGluZ19vYmpzW2ldLCBoYW5nPS0uMSkKICB9CmRldi5vZmYoKQpgYGAKCiMjIyBQbG90IGhlYXQgbWFwcyBvZiB0aGUgcmVzdWx0cwoKYGBge3IgaGVhdG1hcH0KZGlzc2ltQ29sb3JzIDwtIGdyYXkuY29sb3JzKDMwLCBzdGFydCA9IDAuOSwgZW5kID0gMC4xLCBnYW1tYSA9IDIuMiwgYWxwaGEgPSBOVUxMKQphZ2VzIDwtIGFzLm51bWVyaWMocm93bmFtZXMoc2NwZFNNKSkKYWdlQ29sb3JzIDwtIHJhaW5ib3cobGVuZ3RoKGFnZXMpLCBzdGFydD0wLCBlbmQ9NC82KQpoZWF0bWFwKHNjcGRTTSwgY29sPWRpc3NpbUNvbG9ycywgZGlzdGZ1bj1mdW5jdGlvbihtKSB2ZWdkaXN0KG0sbWV0aG9kPSJicmF5IiksIGhjbHVzdGZ1bj1mdW5jdGlvbihtKSBoY2x1c3QobSwgbWV0aG9kPSJhdmVyYWdlIiksIFJvd1NpZGVDb2xvcnMgPSBhZ2VDb2xvcnMpCmBgYAoKTm90ZSAxOiBJJ3ZlIGFkZGVkIHNvbWUgY29sb3JzIHRvIHRoZSBhZ2UgbGFiZWxzLiBJZiBhZGphY2VudCBzdHJhdGEgY2x1c3RlcmVkIHRvZ2V0aGVyLCB0aGUgY29sb3JzIHdvdWxkIGJlIG9yZGVyZWQgaW4gYSBncmFkaWVudC4KTm90ZSAyOiBpbiB0aGlzIGNhc2UgdGhlIGRlbmRyb2dyYW0gZm9yIHRoZSB0YXhvbiBuYW1lcyBpcyBtZWFuaW5nbGVzcywgc28gbGV0J3Mgc3VwcHJlc3MgaXQuIEJ1dCB3ZSBjYW4ndCBkbyB0aGlzIHdpdGggdGhlIGJhc2UgaGVhdG1hcCBmdW5jdGlvbi4gV2UgbmVlZCB0byB1c2UgYSBmdW5jdGlvbiBpbiBhIGRpZmZlcmVudCBwYWNrYWdlOiBncGxvdHMKCmBgYHtyIGhlYXRtYXAuMn0KaGVhdG1hcC4yKHNjcGRTTSwgZGVuZHJvZ3JhbT0ncm93JywgY29sPWRpc3NpbUNvbG9ycywgZGlzdGZ1bj1mdW5jdGlvbihtKSB2ZWdkaXN0KG0sbWV0aG9kPSJicmF5IiksIGhjbHVzdGZ1bj1mdW5jdGlvbihtKSBoY2x1c3QobSwgbWV0aG9kPSJhdmVyYWdlIiksIFJvd1NpZGVDb2xvcnMgPSBhZ2VDb2xvcnMsIHRyYWNlY29sPU5BKQpgYGAKCiMjIENsdXN0ZXIgYW5hbHlzaXM6IENvbnN0cmFpbmVkIGNsdXN0ZXJpbmcKCiMjIyBDb25pc3MgY2x1c3RlcmluZwpTZWUgdGhlIHJlZmVyZW5jZSBwYXBlciBpbiB0aGUgcmVmcyBmb2xkZXIgKEdyaW1tIDE5ODcpIGZvciBtb3JlIGluZm8gb24gdGhlIENPTklTUyBpbXBsZW1lbnRhdGlvbi4gTm90ZSwgaGVyZSB3ZSBoYXZlIGNvbXBhcmVkIHRoZSBkYXRhIHRvIGEgYnJva2VuLXN0aWNrIGRpc3RyaWJ1dGlvbiB0byBlc3RpbWF0ZSB0aGUgbWluaW11bSBudW1iZXIgb2Ygc2lnbmlmaWNhbnQgY2x1c3RlcnMuICBBIGJyb2tlbi1zdGljayBkaXN0cmlidXRpb24gaXMgc2ltcGxlIHRvIGNhbGN1bGF0ZSBhbmQgdGhlIGV4cGVjdGVkIHZhbHVlcyBvZiB0aGUgcGllY2VzIG9mIHRoZSBicm9rZW4gc3RpY2sgYXJlIGdpdmVuIGJ5IEVxdWF0aW9uIDE6CkVqID0gMS9uICogKHN1bShmcm9tIHg9aiB0byBuKSAxL3gpCkVqIGlzIHRoZSBleHBlY3RlZCB2YWx1ZSBvZiB0aGUganRoIHBpZWNlLCBhbmQgbiBpcyB0aGUgbnVtYmVyIG9mIHBpZWNlcyAoYnJlYWtzIGluIHRoaXMgY2FzZSkKCgpgYGB7cn0KY29uc3RyYWluZWQgPC0gY2hjbHVzdChzY3BkU01fYnJheSwgbWV0aG9kPSJjb25pc3MiKQpic3RpY2sgPC0gYnN0aWNrKGNvbnN0cmFpbmVkLCAxMCkKbmdyb3VwcyA8LSBtaW4od2hpY2goYnN0aWNrJGRpc3BlcnNpb24+YnN0aWNrJGJzdGljaykpICAjIGluIHRoaXMgY2FzZSwgbm8gY2x1c3RlcnMgc2hvdWxkIGJlIHJldGFpbmVkIGJlY2F1c2UgZXZlbiB3aXRoIHR3byBncm91cHMsIHRoZSBicm9rZW4gc3RpY2sgdmFsdWVzIGFyZSBoaWdoZXIgdGhhbiB0aGUgb2JzZXJ2ZWQgU3VtIG9mIFNxdWFyZXMuCnggPC0gc3RyYXQucGxvdChzY3BkU00sIHl2YXI9YWdlcywgdGl0bGU9IiIsIHkucmV2PVQsIGNleC54bGFiZWw9MC41LGNleC55YXhpcz0wLjUsIGNleC5heGlzPTAuNSwgY2x1c3Q9Y29uc3RyYWluZWQpIAphZGRDbHVzdFpvbmUoeCwgY29uc3RyYWluZWQsIGNvbD0icmVkIiwgbmdyb3VwcykKYGBgCgpBZGQgdGhpcyBjb25zdHJhaW5lZCBjbHVzdGVyaW5nIGRlbmRyb2dyYW0gdG8gdGhlIGhlYXRtYXAgaW5zdGVhZCBvZiB0aGUgZGVmYXVsdCBoY2x1c3QgZGVuZHJvLiBOb3RlIHRoYXQgbm93IHRoZSBjb2xvcnMvYWdlcyBhcmUgb3JkZXJlZC4KCmBgYHtyIGhlYXRtYXAuMl9jb25zdHJhaW5lZH0KY29uc0RlbmRybyA8LSBhcy5kZW5kcm9ncmFtKGNvbnN0cmFpbmVkKQpoZWF0bWFwLjIoc2NwZFNNLCBSb3d2PSBjb25zRGVuZHJvLCBjb2w9ZGlzc2ltQ29sb3JzLCBkaXN0ZnVuPWZ1bmN0aW9uKG0pIHZlZ2Rpc3QobSxtZXRob2Q9ImJyYXkiKSwgUm93U2lkZUNvbG9ycyA9IGFnZUNvbG9ycywgdHJhY2Vjb2w9TkEsIGRlbmRyb2dyYW09InJvdyIpCmBgYAoKIyMjIEFsdGVybmF0ZSBwYWNrYWdlcyBmb3IgY29uc3RyYWluZWQgY2x1c3RlcmluZwoKSSByYW4gYWNyb3NzIHRoaXMgcGFja2FnZSBmb3IgY29uc3RyYWluZWQgY2x1c3RlcmluZzogYWRqY2x1c3QuICBTZWUgdGhpcyB3ZWJzaXRlIDxodHRwczovL2dpdGh1Yi5jb20vcnN0YXRzLWdzb2MvZ3NvYzIwMTcvd2lraS9Db25zdHJhaW5lZC1IaWVyYXJjaGljYWwtQWdnbG9tZXJhdGl2ZS1DbHVzdGVyaW5nPiBmb3IgYW4gaW4tZGVwdGggZGlzY3Vzc2lvbgoKYGBge3J9CmZpdDEgPC0gYWRqQ2x1c3Qoc2NwZFNNX2JyYXksIHR5cGU9ImRpc3NpbWlsYXJpdHkiKQpwbG90KGZpdDEsIGxlYWZsYWI9InBlcnBlbmRpY3VsYXIiKQpgYGAKCiMjIEV4ZXJjaXNlcwoKMS4gR2VuZXJhdGUgc29tZSBleGFtcGxlIGRhdGEsIGNhbGN1bGF0ZSBkaXNzaW1pbGFyaXR5LCBhbmQgdGhlbiBoYW5kLWNhbGN1bGF0ZSB0aGUgZGVuZHJvZ3JhbSB1c2luZyBvbmUgY2x1c3RlcmluZyBhbGdvcml0aG0KMi4gRXhwbG9yZSBjbHVzdGVyIGFuYWx5c2lzIHVzaW5nIHRoZSBmb3NzaWwgcG9sbGVuIGRhdGEuCiAgLSBQcm9kdWNlIGEgcG9sbGVuIGRpYWdyYW0gd2l0aCBhIGNvbnN0cmFpbmVkIGNsdXN0ZXIgZGlhZ3JhbSB1c2luZyB0aGUgZm9zc2lsIHBvbGxlbiBkYXRhIGZvciAxMiwwMDAgeWVhcnMgYWdvLiAKICAtIEhvdyBtYW55IGdyb3Vwcy96b25lcyBzaG91bGQgYmUgZGlzcGxheWVkPyA=